Picture for Jared Kaplan

Jared Kaplan

Shammie

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Add code
Jan 08, 2026
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Figure 1 for Forecasting Rare Language Model Behaviors
Figure 2 for Forecasting Rare Language Model Behaviors
Figure 3 for Forecasting Rare Language Model Behaviors
Figure 4 for Forecasting Rare Language Model Behaviors
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Clio: Privacy-Preserving Insights into Real-World AI Use

Add code
Dec 18, 2024
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Evaluating and Mitigating Discrimination in Language Model Decisions

Add code
Dec 06, 2023
Figure 1 for Evaluating and Mitigating Discrimination in Language Model Decisions
Figure 2 for Evaluating and Mitigating Discrimination in Language Model Decisions
Figure 3 for Evaluating and Mitigating Discrimination in Language Model Decisions
Figure 4 for Evaluating and Mitigating Discrimination in Language Model Decisions
Viaarxiv icon